Fast Nonparametric Machine Learning Algorithms for High-dimensional Massive Data and Applications

نویسندگان

  • Ting Liu
  • Martial Hebert
  • Jeff Schneider
  • Trevor Darrell
چکیده

Nonparametric methods have become increasingly popular in statistics and probabilistic AI communities. One well-known nonparametric method is “nearestneighbor”. It uses the observations in the training set T closest in input space to a query q to form the prediction of q. Specifically, when k of the observations in T are considered, it is called k-nearest-neighbor (or k-NN). Despite its simplicity, k-NN and its variants have been successful in many machine learning problems, including pattern recognition, text categorization, information retrieval, computational statistics, database and data mining. It is also used for estimating sample distributions and Bayes error. However, k-NN and many related nonparametric methods remain hampered by their computational complexity. Many spatial methods, such as metrictrees, have been proposed to alleviate the computational cost, but the effectiveness of these methods decreases as the number of dimensions of feature vectors increases. From another direction, researchers are trying to develop ways to find approximate answers. The premise of this research is that in many cases it is not necessary to insist on the exact answers; instead, determining an approximate answer should be sufficient. In fact, some approximate methods show good performance in a number of applications, and some methods enjoy very good theoretical soundness. However, when facing hundreds or thousands dimensions, many algorithms do not work well in reality. I propose four new spatial methods for fast k-NN and its variants, namely KNS2, KNS3, IOC and spill-tree. The first three algorithms are designed to speed up k-NN classification problems, and they all share the same insight that finding the majority class among the k-NN of q need not to explicitly find those k-nearest-neighbors. Spill-tree is designed for approximate k-NN search. By adapting metric-trees to a more flexible data structure, spill-tree is able to adapt to the distribution of data and it scales well even for huge highdimensional data sets. Significant efficiency improvement has been observed comparing to LSH (localify sensitive hashing), the state of art approximate k-NN algorithm. We applied spill-tree to three real-world applications: shot video segmentation, drug activity detection and image clustering, which I will explain in the thesis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Nonparametric Learning in High Dimensions

This thesis develops flexible and principled nonparametric learning algorithms to explore, understand, and predict high dimensional and complex datasets. Such data appear frequently in modern scientific domains and lead to numerous important applications. For example, exploring high dimensional functional magnetic resonance imaging data helps us to better understand brain functionalities; infer...

متن کامل

Fast generalized subset scan for anomalous pattern detection

We propose Fast Generalized Subset Scan (FGSS), a new method for detecting anomalous patterns in general categorical data sets. We frame the pattern detection problem as a search over subsets of data records and attributes, maximizing a nonparametric scan statistic over all such subsets. We prove that the nonparametric scan statistics possess a novel property that allows for efficient optimizat...

متن کامل

Scalable machine learning for massive datasets: Fast summation algorithms

Most state-of-the-art nonparametric machine learning algorithms have a computational complexity of either O(N) or O(N), where N is the number of training examples. This has seriously restricted the use of massive data sets. The bottleneck computational primitive at the heart of various algorithms is the multiplication of a structured matrix with a vector, which we refer to as matrix-vector prod...

متن کامل

Machine learning algorithms for time series in financial markets

This research is related to the usefulness of different machine learning methods in forecasting time series on financial markets. The main issue in this field is that economic managers and scientific society are still longing for more accurate forecasting algorithms. Fulfilling this request leads to an increase in forecasting quality and, therefore, more profitability and efficiency. In this pa...

متن کامل

Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues

Regularization is a widely used technique throughout statistics, machine learning, and applied mathematics. Modern applications in science and engineering lead to massive and complex data sets, which motivate the use of more structured types of regularizers. This survey provides an overview of the use of structured regularization in high-dimensional statistics, including regularizers for group-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006